This report explores the Prosper Loan data set which contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
## [1] 113937 81
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
Let’s look at the distribution of loans according to their terms in months.
##
## 12 36 60
## 1614 87778 24545
It is also interesting to see, how high the loan amounts are (with different bin sizes $1000 and $5000). Most loans are under $5000.
Loans range from $1000 to $35000 max. The median is $6500.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
Applying log scale on the histogram does not make any signs of normality visible .
Distribution of Borrower annual rates shows that most rates range from ~0.15% to ~0.2%.
Loans per year/month It turns out that date is in the date format. We will extract years and months to create plots for time-series and seasonality. Most loans were taken in 2013. Also it turns out, that the number of loans increases with the course of the year and that most people take out a loan in October and December. It would be interesting to analyse how high the total amount of all loans was per year and what the average is. Year 2014 was removed from the monthly perspective, as it was not finished completely when the dataset was created and there would #todo: to univariate Average loan amounts per year. And total amounts. It is well possible, that 2014 is not over yet, as the sum of loans is not quite as high as 2013, however, the average is so far still higher than 2013.
## geom_bar: na.rm = FALSE
## stat_count: na.rm = FALSE
## position_stack
The bulk of the distribution of the prosper score lies in the middle. Most of people get average scores.
Taking a look at the usage categories of loans, it turns out that more than 50% are user for debt consolidation.
Most of the recipients are employed, however they also do not specify on the type of their emplyoment. It would be interesting to find out, how long employment status last depending on each type. There are many empty employment types which were changed to ‘Not available’ in order to clean the data.
The incomes do not seem to be ordered. This needs to be fixed. Also, there is a group labeled as ‘Not employed’. One has to ask, whether to join this group with the $0 group, however, it can not be said whether the ‘Not employed’ group can be treated as such, as they may have other income sources (e.g. stocks, rental income, …)
##
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
Looking at the distribution of income ranges, most people earn between $25000 and $49999. The second larges group ranges from $50000 to $74999. It would be interesting to see, how the different income ranges are composed in terms of employment status, which we will see in the multi-variate part of the analysis.
Open vs Current Credit lines Open and current credit lines are a right-skewed distribution.
Home ownership is around 50-50 (Slightly more home owners)
## False True
## 56459 57478
Most borrowers have high bank card utilization. Utilisation over 1.0 would then mean overdrafting the bank account.
Most people take loans as individuals.
The dataset contains 113937 observations of 81 features.
BorrowerAPR
BorrowerRate
LenderYield
ProsperScore
EmploymentStatus
EmploymentStatusDuration
BankcardUtilization
IncomeRange
LoanOriginalAmount
LoanOriginationDate
Term
ListingCategory
IsBorrowerHomeowner
CurrentlyInGroup
CurrentDelinquencies
DelinquenciesLast7Years
PublicRecordsLast12Months
PublicRecordsLast10Years
AvailableBankcardCredit
CurrentCreditLines
OpenCreditLines
Year
Month
Employment Status had an empty values issue which is why they were changed to “Not available” to match the existing variable. As Income Range was unordered, the ordering was adjusted and 100k+ incomes were put to the right position. Also, the date column was unusable which is why Year and Month were extracted and put into separate columns. Listing Categories were converted from numerical values to strings, in order to understand the different categories better in the plot.
As “Not employed” is much steeper than Employed and e.g. full time this suggests that most people are fortunately unemployed only for a relatively short time. Also part time employment does not seem to last as long as full time, having bulk of distribution closer to the right.
The borrower annual rate decreases with the increase in their score. This is also a moderate correlation of 0.65.
##
## Pearson's product-moment correlation
##
## data: df$ProsperScore and df$BorrowerAPR
## t = -261.68, df = 84851, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6719940 -0.6645469
## sample estimates:
## cor
## -0.6682872
Borrower Rate (obviously) has highly linear relation to Lender yield, however, there are some values, which are not on the line… We will analyse this phenomenon in the multi variate analysis.
##
## Pearson's product-moment correlation
##
## data: df$BorrowerRate and df$LenderYield
## t = 8493.9, df = 113940, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9992021 0.9992204
## sample estimates:
## cor
## 0.9992113
Thers is slight correlation between loan amount and and APR.
##
## Pearson's product-moment correlation
##
## data: df$BorrowerAPR and df$LoanOriginalAmount
## t = -115.14, df = 113910, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3280787 -0.3176752
## sample estimates:
## cor
## -0.3228867
Mean APR had its highest point in 2011 and then decreased again.
##
## Pearson's product-moment correlation
##
## data: df$Year and df$BorrowerAPR
## t = 21.946, df = 113910, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.05910109 0.07066652
## sample estimates:
## cor
## 0.06488598
Most borrowers have less than 50 Credit lines and less than 25 delinquencies in the past 7 years.
People with higher bakcard utilisation tend to have higher loan rates.
##
## Pearson's product-moment correlation
##
## data: df$BankcardUtilization and df$BorrowerAPR
## t = 88.323, df = 106330, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2558295 0.2670290
## sample estimates:
## cor
## 0.261438
People with more income tend to borrow higher amounts. Only people earning over 100k got loans higher than 25k.
Borrowers with higher income also get better anual rates. The fact that people with an income of $0 get low rates could be that there are many students in this group who get cheap student loans.
Prosper Score depends on income. If score is smaller than 50k, it stays the same on average. It is also the same for ranges from 50k to 75k. Average is highest above 100k.
On log scale, delinquencies become better visible. However, they do not really vary visibly among income ranges.
It turns out that people with higher income on average get higher amounts and better rates for loans.
Loan rates increased until 2011 and then decreased again. Maybe this is a consequence of the financial crisis around 2008.
BorrowerRate, LenderYield -> 0.99 BorrowerAPR, ProsperScore -> -0.67 ProsperScore, LoanOriginalAmount -> -0.32
As “Not employed” and “$0” are not of the same quality, the analysis of employment statuses was conducted to see the composition of different statuses.
Borrower rate is strictly linear to the Lender Yield. However, if there is no score available, the lender yield seems to deviate from the ideal margin line as can be seen in the following two plots. The upper one has the NA Prosper Score values removed while the lower one incorporates them as grey dots.
Relationship between BorrowerAPR, Loan Amount and Score shows that APR slightly decreases when amount increases and when score increases.
Long term loans (60 months) tend to have higher amounts than shorter loans (36 months). Short term loans (12 tend to have the lowest amounts).
When comparing Income Ranges in terms of loan amounts, it can be seen that the bulks of the high income ranges are more to the right side (higher loans) and the low income ranges on the left side (lower loans).
Bank card utilisation per income range shows that higher income ranges have a larger bulk at high credit card utilisation rates.
Mean loan amount per year per income range again clearly shows, that higher income ranges take get higher loans. Also it can be seen, that there is no value for $0 for 2014. The group is not very realistic, which may be why it was dropped. “Not displayed” shows a lack of data before 2007.
Relationship between borrower rate, term and its APR Different Terms are clustered in this plot based on their Rate/APR ratio.
Lower income is connected to lower employment duration.
Lower income ranges have much higher APRs. After 2012 they even increased, while for higher income ranges APRs decreased and ceased to exist after 2013.
Linear Model for score
##
## Calls:
## m1: lm(formula = BorrowerAPR ~ ProsperScore, data = df)
## m2: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount,
## data = df)
## m3: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount +
## Year, data = df)
## m4: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount +
## Year + EmploymentStatus, data = df)
## m5: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount +
## Year + EmploymentStatus + BankcardUtilization, data = df)
## m6: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount +
## Year + EmploymentStatus + BankcardUtilization + IncomeRange,
## data = df)
## m7: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount +
## Year + EmploymentStatus + BankcardUtilization + IncomeRange +
## Month, data = df)
## m8: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount +
## Year + EmploymentStatus + BankcardUtilization + IncomeRange +
## Month + Term, data = df)
## m10: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount +
## Year + EmploymentStatus + BankcardUtilization + IncomeRange +
## Month + Term + CurrentDelinquencies + DebtToIncomeRatio,
## data = df)
## m11: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount +
## Year + EmploymentStatus + BankcardUtilization + IncomeRange +
## Month + Term + CurrentDelinquencies + DebtToIncomeRatio +
## CurrentlyInGroup, data = df)
## m12: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount +
## Year + EmploymentStatus + BankcardUtilization + IncomeRange +
## Month + Term + CurrentDelinquencies + DebtToIncomeRatio +
## CurrentlyInGroup + IsBorrowerHomeowner, data = df)
##
## ============================================================================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m10 m11 m12
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 0.360*** 0.377*** 42.134*** 52.814*** 54.260*** 54.059*** 56.838*** 58.612*** 57.799*** 57.906*** 58.707***
## (0.001) (0.001) (0.325) (0.369) (0.363) (0.363) (0.396) (0.397) (0.418) (0.419) (0.416)
## ProsperScore -0.022*** -0.020*** -0.023*** -0.023*** -0.022*** -0.022*** -0.022*** -0.022*** -0.021*** -0.021*** -0.021***
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## LoanOriginalAmount -0.000*** -0.000*** -0.000*** -0.000*** -0.000*** -0.000*** -0.000*** -0.000*** -0.000*** -0.000***
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## Year -0.021*** -0.026*** -0.027*** -0.027*** -0.028*** -0.029*** -0.029*** -0.029*** -0.029***
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## EmploymentStatus: Full-time/Employed -0.040*** -0.042*** -0.042*** -0.045*** -0.045*** -0.044*** -0.044*** -0.043***
## (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
## EmploymentStatus: Not employed/Employed 0.026*** 0.028*** -0.002 -0.002 -0.002 -0.043 -0.038 -0.029
## (0.002) (0.002) (0.008) (0.007) (0.007) (0.048) (0.048) (0.047)
## EmploymentStatus: Other/Employed 0.002* 0.003*** -0.000 0.000 0.001 -0.000 -0.001 0.001
## (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
## EmploymentStatus: Part-time/Employed -0.041*** -0.041*** -0.050*** -0.053*** -0.053*** -0.050*** -0.050*** -0.050***
## (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003)
## EmploymentStatus: Retired/Employed -0.030*** -0.030*** -0.034*** -0.036*** -0.035*** -0.036*** -0.035*** -0.034***
## (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003)
## EmploymentStatus: Self-employed/Employed -0.021*** -0.018*** -0.018*** -0.018*** -0.017*** -0.029*** -0.029*** -0.029***
## (0.001) (0.001) (0.001) (0.001) (0.001) (0.007) (0.007) (0.007)
## BankcardUtilization 0.032*** 0.034*** 0.035*** 0.034*** 0.036*** 0.036*** 0.038***
## (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
## IncomeRange: $1-24,999/$0 -0.016* -0.016* -0.016*
## (0.007) (0.007) (0.007)
## IncomeRange: $100,000+/$0 -0.032*** -0.031*** -0.030*** -0.012*** -0.012*** -0.006***
## (0.007) (0.007) (0.007) (0.001) (0.001) (0.001)
## IncomeRange: $25,000-49,999/$0 -0.028*** -0.027*** -0.028*** -0.010*** -0.010*** -0.008***
## (0.007) (0.007) (0.007) (0.001) (0.001) (0.001)
## IncomeRange: $50,000-74,999/$0 -0.033*** -0.032*** -0.032*** -0.014*** -0.014*** -0.010***
## (0.007) (0.007) (0.007) (0.001) (0.001) (0.001)
## IncomeRange: $75,000-99,999/$0 -0.033*** -0.032*** -0.032*** -0.014*** -0.014*** -0.009***
## (0.007) (0.007) (0.007) (0.001) (0.001) (0.001)
## Month: 02/01 -0.003*** -0.003*** -0.003*** -0.003*** -0.003***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## Month: 03/01 -0.003*** -0.004*** -0.004*** -0.004*** -0.004***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## Month: 04/01 -0.001 -0.003*** -0.004*** -0.004*** -0.004***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## Month: 05/01 -0.004*** -0.006*** -0.007*** -0.007*** -0.007***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## Month: 06/01 -0.004*** -0.007*** -0.008*** -0.008*** -0.008***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## Month: 07/01 -0.004*** -0.007*** -0.007*** -0.007*** -0.007***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## Month: 08/01 -0.007*** -0.009*** -0.010*** -0.010*** -0.010***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## Month: 09/01 -0.005*** -0.006*** -0.007*** -0.007*** -0.007***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## Month: 10/01 -0.008*** -0.009*** -0.009*** -0.009*** -0.009***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## Month: 11/01 -0.012*** -0.013*** -0.013*** -0.013*** -0.013***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## Month: 12/01 -0.018*** -0.019*** -0.018*** -0.019*** -0.019***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## Term 0.001*** 0.001*** 0.001*** 0.001***
## (0.000) (0.000) (0.000) (0.000)
## CurrentDelinquencies 0.005*** 0.005*** 0.005***
## (0.000) (0.000) (0.000)
## DebtToIncomeRatio 0.007*** 0.007*** 0.008***
## (0.001) (0.001) (0.001)
## CurrentlyInGroup: True/False -0.006*** -0.006***
## (0.001) (0.001)
## IsBorrowerHomeowner: True/False -0.012***
## (0.000)
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.447 0.513 0.592 0.612 0.625 0.627 0.631 0.637 0.639 0.639 0.645
## adj. R-squared 0.447 0.513 0.592 0.612 0.625 0.627 0.631 0.636 0.639 0.639 0.645
## sigma 0.059 0.056 0.051 0.050 0.049 0.049 0.049 0.048 0.048 0.048 0.047
## F 68477.863 44693.580 41078.950 14843.372 14157.563 9519.616 5584.057 5502.654 4904.649 4737.888 4690.494
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood 119107.488 124531.466 132064.086 134126.052 135649.676 135878.977 136325.808 136946.168 126186.902 126199.599 126793.657
## Deviance 299.890 263.900 220.970 210.487 203.062 201.968 199.852 196.951 175.361 175.304 172.639
## AIC -238208.976 -249054.931 -264118.171 -268230.104 -271275.352 -271723.954 -272595.616 -273834.336 -252313.804 -252337.197 -253523.315
## BIC -238180.930 -249017.537 -264071.428 -268127.269 -271163.168 -271565.027 -272333.853 -273563.224 -252036.041 -252050.176 -253227.034
## N 84853 84853 84853 84853 84853 84853 84853 84853 77557 77557 77557
## ============================================================================================================================================================================================================================
Conducting multivariate analyses it becomes even clearer that based on how high the income is, the loans are higher and have lower rates. The plots also visualise how the rates and amounts changed over the years. It could also be credit card utilisation patterns vary within different income groups.
Prosper did not differentiate before 2007 and does not have any borrowers with zero income on unemployment status after as of 2013 (maybe because they did not allow any loans for this group.)
The multinomial linear regression model takes into account 12 different variables and achieves an R^2 of 0.645 for predicting BorrowerAPR. It only looks at linear relationships of variables. Therefore, performance of the model could be improved, by looking at non-linear relations between variables. Also, advanced feature engineering and cleaning of the data set and adding more variables could improve performance.
Most loans are being taken out during the end of the year. This is quite interesting and can have several reasons. The most intuitive one is that people have plans for the following year and therefore borrow money. Another theory might be that people run out of money during the end of the year and need to borrow.
The BorrowerAPR is highly dependent on the income range. Having more income makes it more likely to get a better rate. On the other hand, having lower income increases the chances to get a worse rate. There is an exception with earning $0. This group could include students who get cheaper student loans.
This plot shows the chronological sequence of the APR based on the borrowers’ income ranges. It can be seen that the overall rate increases until 2011 and subsequently drops, except for low earners and unemployed borrowers who cease to exist in the data set as of 2013.
This data analysis included the behaviour of various borrowers based on various features. Insight was gained, especially considering the borrowers’ employment statuses, income ranges, credit card utilisation and use of loans. Finally, a model was generated to predict borrowers rates based on 12 of their characteristics.
The dataset was challenging to some extent. As it includes approx. 80 variables a lot of work was initially pu into understanding the dataset and looking at different variables. Ggpairs on a short list of features was especially helpful and also using the spreadsheet supported with getting a better grasp on the different features.
Additionally, some of the struggles included having to deal with features which were not internally congruent, e.g. income range, which included “Not employed”, even though this should have been part of “Employment Status” only. Also, different values of “Employment Status” did not have the same quality. While, “Employed” and “Unemployed” are contrary, “Full-time” and “Part-time” are subsets of “Employed” and cannot be directly compared with “Employed”.
On the other hand, it was suprising to see, how the different features influence the BorrowerAPR (e.g. Score and Income Range). It was also suprising that low-income and unemployed borrowers ceased to exist in the data set as of 2013.
Further cleaning, feature engineering and exploring more variables could be taken into account in order to gain further insights and build more powerful models. For example, it could be explored, which estimated values (e.g. EstimatedLoss, EstimatedReturn, EstimatedYield …) have on the model or what impact variables such as Occupation or CurrentlyInAGroup have. Also, it could be analysed, if the performance of the model changes, when getting rid of the pre-2007 values, which do not have different income ranges incorprated. Further features could be engineered and used for prediction, e.g. ratios between features such as Credit Lines by Delinquencies. Finally, it could be analysed if non-linear assumptions or higher-range functions of features generate better model results.